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Abstract 

Given a sample covariance matrix, we examine the problem of maximizing the variance explained 
by a linear combination of the input variables while constraining the number of nonzero coefficients 
in this combination. This is known as sparse principal component analysis and has a wide array of 
applications in machine learning and engineering. Unfortunately, this problem is also combinatorially 
hard and we discuss convex relaxation techniques that efficiently produce good approximate solutions. 
We then describe several algorithms solving these relaxations as well as greedy algorithms that iteratively 
improve the solution quality. Finally, we illustrate sparse PCA in several applications, ranging from 
senate voting and finance to news data. 

1 Introduction 

Principal component analysis (PCA) is a classical tool for data analysis, visualization and dimensionality 
reduction and has a wide range of applications throughout science and engineering. Starting from a multi- 
variate data set, PCA finds linear combinations of the variables called principal components, corresponding 
to orthogonal directions maximizing variance in the data. Numerically, a full PCA involves a singular value 
decomposition of the data matrix. 

One of the key shortcomings of PCA is that the factors are linear combinations of all original variables; 
that is, most of factor coefficients (or loadings) are non-zero. This means that while PCA facilitates model 
interpretation and visualization by concentrating the information in a few factors, the factors themselves are 
still constructed using all variables, hence are often hard to interpret. 

In many applications, the coordinate axes involved in the factors have a direct physical interpretation. 
In financial or biological applications, each axis might correspond to a specific asset or gene. In problems 
such as these, it is natural to seek a trade-off between the two goals of statistical fidelity (explaining most 
of the variance in the data) and interpretability (making sure that the factors involve only a few coordinate 
axes). Solutions that have only a few nonzero coefficients in the principal components are usually easier to 
interpret. Moreover, in some applications, nonzero coefficients have a direct cost (e.g., transaction costs in 
finance) hence there may be a direct trade-off between statistical fidelity and practicality. Our aim here is to 
efficiently derive sparse principal components, i.e, a set of sparse vectors that explain a maximum amount 
of variance. Our motivation is that in many applications, the decrease in statistical fidelity required to obtain 
sparse factors is small and relatively benign. 

In what follows, we will focus on the problem of finding sparse factors which explain a maximum 
amount of variance in the original data, which can be written 

max z T Sz-pCard(z) (1) 
llz||<i 
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in the variable z G R n , where S £ S n is the (symmetric positive semi-definite) sample covariance matrix, p 
is a parameter controlling sparsity, and Card(,z) denotes the cardinal (or £q norm) of z, i.e. the number of 
non zero coefficients of z. 

While PCA is numerically easy, each factor requires computing a leading eigenvector, which can be 
done in 0(n 2 ) floating point operations using the Lanczos method for example (see e.g. (Golub and 
Van Loan, 1990, §8.3, §9.1.1) or Saad (1992) for details), sparse PCA is a hard combinatorial problem. 
In fact, Moghaddam et al. (2006a) show that the subset selection problem for ordinary least squares, which 
is NP-hard Natarajan (1995), can be reduced to a sparse generalized eigenvalue problem, of which sparse 
PCA is a particular instance. Sometimes factor rotation techniques are used to post-process the results 
from PCA and improve interpretability (see QUARTIMAX by Neuhaus and Wrigley (1954), VARIMAX by 
Kaiser (1958) or Jolliffe (1995) for a discussion). Results by e.g. Amini and Wainwright (2009) show that 
this naive approach has significantly worst convergence rates than the relaxations we present here. Another 
straightforward solution is to threshold to zero loadings with small magnitude Cadima and Jolliffe (1995), 
but outside of easy cases, the methods highlighted below always perform better in situation when only a few 
observations are available or when significant noise is present. 

A more systematic approach to the problem arose in recent years, with various researchers proposing 
nonconvex algorithms (e.g., SCoTLASS by Jolliffe et al. (2003), SLRA by Zhang et al. (2002) or D.C. 
based methods Sriperumbudur et al. (2007) which find modified principal components with zero loadings. 
The SPCA algorithm, which is based on the representation of PCA as a regression-type optimization prob- 
lem Zou et al. (2006), allows the application of the LASSO Tibshirani (1996), a penalization technique 
based on the t\ norm. With the exception of simple thresholding, all the algorithms above require solving 
non convex problems. Recently also, d'Aspremont et al. (2007) derived an i\ based semidefinite relaxation 
for the sparse PCA problem (1) with a complexity of 0(n 4 \/logn) for a given p. Moghaddam et al. (2006b) 
used greedy search and branch-and-bound methods to solve small instances of problem ( 1 ) exactly and get 
good solutions for larger ones. Each step of this greedy algorithm has complexity 0(n 3 ), leading to a total 
complexity of 0(n 4 ) for a full set of solutions. Moghaddam et al. (2007) improve this bound in the re- 
gression/discrimination case. Journee et al. (2008) use an extension of the power method to (locally) solve 
the problem defined here, as well as the "block" problem of finding several sparse principal components at 
once. Loss of orthogonality means that there is no natural method for deflating the matrix once a sparse 
principal component is found and Mackey (2009) discusses several options, comparing the variance vs. 
orthogonality/sparsity tradeoffs they imply. Finally, Amini and Wainwright (2009) derive explicit sample 
size thresholds for recovery of true sparse vector using either simple thresholding methods or semidefinite 
relaxations, in a spiked model for the covariance. 

Here, we detail two semidefinite relaxations for sparse PCA, and describe algorithms to solve the relax- 
ations efficiently. We also test these techniques on various data sets: newsgroup data, Senate voting records 
and stock market returns. 

Notation 

1 /2 

For a vector z G R, we let ||z||i = Y17=i Nl an< ^ IMI = \%27=i Z T) > Card(z) is the cardinality of z, i.e. 
the number of nonzero coefficients of z, while the support / of z is the set {i : z% ^ 0} and we use I c to 
denote its complement. For f3 G R, we write (3 + = max{/3, 0} and for X G S n (the set of symmetric matrix 
of size n x n) with eigenvalues Aj, Tr(X) + = Y^l=\ max{Aj, 0}. The vector of all ones is written 1, while 
the identity matrix is written I. The diagonal matrix with the vector u on the diagonal is written diag(u). 
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2 Semidefinite relaxations 



Let E G S n be a symmetric matrix. We consider the following sparse PCA problem 

(p(p) = max z T T,z — /oCard(z) (2) 

||z||<l 

in the variable z G R n where p > is a parameter controlling sparsity. We assume without loss of generality 
that E G S n is positive semidefinite and that the n variables are ordered by decreasing marginal variances, 
i.e. that Sn > . . . > £ nn . We also assume that we are given a square root A of the matrix £ with 
£ = A T A, where A G R nxn and we denote by oi, . . . , a n G R n the columns of A. Note that the problem 
and our algorithms are invariant by permutations of £ and by the choice of square root A. In practice, we 
are very often given the data matrix A instead of the co variance E. 

A problem that is directly related to (2) is that of computing a cardinality constrained maximum eigen- 
value, by solving 

maximize z T T,z 

subject to Card(z) < k (3) 

in the variable z G R n . Of course, this problem and (2) are related. By weak duality, an upper bound on the 
optimal value of (3) is given by 

inf 4>(p) + pk. 

where P is the set of penalty values for which 4>(p) has been computed. This means in particular that if a 
point z is provably optimal for (2), it is also globally optimum for (3) with k = Card(z). 



2.1 A semidefinite relaxation with £i penalization 

Here, we briefly recall the l\ based relaxation derived in d' Aspremont et al. (2007). Following the lifting pro- 
cedure for semidefinite relaxation described in Lovasz and Schrijver (1991), Alizadeh (1995), Lemarechal 
and Oustry (1999) for example, we rewrite (3) as 

maximize Tr(EX) 
subject to Tr(X) = 1 

Card(X) < k 2 (4) 

X y 0, Rank(X) = 1, 

in the (matrix) variable X G S n . Programs (3) and (4) are equivalent, indeed if X is a solution to the above 
problem, then X y and Rank(X) = 1 mean that we have X = xx T , while Tr(X) = 1 implies that 
\\x\\2 = 1. Finally, if X = xx T then Card(X) < k 2 is equivalent to Card(x) < k. We have made some 
progress by turning the convex maximization objective x T T,x and the nonconvex constraint ||x||2 = 1 into 
a linear constraint and linear objective. Problem (4) is, however, still nonconvex and we need to relax both 
the rank and cardinality constraints. 

Since for every u G R n , Card(n) = q implies ||u||i < ^/g||u||2, we can replace the nonconvex 
constraint Card(X) < k 2 , by a weaker but convex constraint: < k, where we exploit the property 

that ||-X"||j? = V x T x = 1 when X = xx T and Tr{X) = 1. If we drop the rank constraint, we can form a 
relaxation of (4) and (3) as 

maximize Tr(EX) 
subject to Tr(X) = 1 

1 T |X|1 < k (5) 

x y o, 

which is a semidefinite program in the variable X G S n , where k is an integer parameter controlling the 
sparsity of the solution. The optimal value of this program will be an upper bound on the optimal value 
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of the variational problem in (3). Here, the relaxation of Card(X) in 1 T |X|1 corresponds to a classic 
technique which replaces the (non-convex) cardinality or Iq norm of a vector x with its largest convex lower 
bound on the unit box: \x\, the l\ norm of x (see Fazel et al. (2001) or Donoho and Tanner (2005) for other 
applications). 

Problem (5) can be interpreted as a robust formulation of the maximum eigenvalue problem, with addi- 
tive, componentwise uncertainty in the input matrix £. We again assume A to be symmetric and positive 
semidefinite. If we consider a variation in which we penalize by the t\ norm of the matrix X instead of 
imposing a hard bound, to get 

maximize Tr(£X) - pl T \X\l 

subject to Tr(X) = 1 (6) 

x y 0, 

which is a semidefinite program in the variable X £ S n , where p > controls the magnitude of the penalty. 
We can rewrite this problem as 

max min Tr(X(E + U)) (7) 

X^0,Tr(X)=l \U l3 \<p 

in the variables X £ S n and f7 € S". This yields the following dual to (6) 

minimize A max (I! + U) 

subject to \Uij\ < p, i,j = l,...,n, 

which is a maximum eigenvalue problem with variable U £ S n . This gives a natural robustness inter- 
pretation to the relaxation in (6): it corresponds to a worst-case maximum eigenvalue computation, with 
componentwise bounded noise of intensity p imposed on the matrix coefficients. 

Finally, the KKT conditions (see (Boyd and Vandenberghe, 2004, §5.9.2)) for problems (6) and (8) are 
given by 

(S + U)X = A max (£ + U)X 
UoX = -p\X\ 

Tr(X) = 1, X y y ' 

\Uij\ < p, i,j = 1,... ,n. 

If the eigenvalue A max (£ + U) is simple (when, for example, \ ma,yi {A) is simple and p is sufficiently small), 
the first condition means that Rank(X) = 1 and the semidefinite relaxation is tight, with in particular 
Card(X) = Card(x) 2 if x is the dominant eigenvector of X. When the optimal solution X is not of 
rank one because of degeneracy (i.e. when A max (£ + U) has multiplicity strictly larger than one), we can 
truncate X as in Alizadeh (1995); Lemarechal and Oustry (1999), retaining only the dominant eigenvector 
x as an approximate solution to the original problem. In that degenerate scenario however, the dominant 
eigenvector of X is not guaranteed to be as sparse as the matrix itself. 

2.2 A semidefinite relaxation with £ penalization 

We summarize here the results in d'Aspremont et al. (2008). We begin by reformulating (2) as a rela- 
tively simple convex maximization problem. Suppose that p > Sn. Since z T T,z < £n(^" =1 |zj|) 2 and 
(£2=1 Nil) 2 < IMP Card(z) for all z £ R n , we have 

4>{p) = max|uii<i z T Y,z — p Card(z) 
< (£n -p)Card(z) 
<0, 

hence the optimal solution to (2) when p > Sn is z = 0. From now on, we assume p < Sn in which case 
the inequality ||z|| < 1 is tight. We can represent the sparsity pattern of a vector z by a vector u £ {0, l} n 
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and rewrite (2) in the equivalent form 

4>(p) = max A max (diag(n)Sdiag(u)) - pl T u 

mG{0,1}" 

= max A max (diag(«) J 4 T J 4diag(u)) — pl T u 

ue{o,i} n 

= max \ ms , x (Adiag(u)A T ) - pl T u, 

u£{0,l} n 

using the fact that diag(u) 2 = diag(u) for all variables u G {0, 1}™ and that for any matrix B, \ maiX (B T B) - 
A max ( J B J B T ). We then have 

4>(p) = max A max (Adiag(n)^ T ) - pl T u 

«e{o,i}" 

= max max x T Adiag(u)A T x — pl T u 

\\x\\=l uG{0,l} n 

n 

= max max uAiaJ x) 2 — p) . 
11x11=1 ueio,^^ 

Hence we finally get, after maximizing in u (and using max„ e { u (3v = /3+) 

4>(p) = max V((af x) 2 - p) + , (10) 
11*11=1^ 

which is a nonconvex problem in the variable x G R n . We then select variables i such that (afx) 2 — p > 0. 
Note that if T*u = ajai < p, we must have (afx) 2 < ||ai|| 2 ||x|| 2 < p hence variable i will never be part of 
the optimal subset and we can remove it. 

Because the variable x appears solely through X = xx T , we can reformulate the problem in terms of X 
only, using the fact that when ||x|| = 1, X = xx T is equivalent to Tr(X) = 1, X y and Rank(X) = 1. 
We thus rewrite (10) as 

<t>(p)= max. J27=i( a I Xa i ~ P)+ 

s.t. Tr(X) = 1, Rank(X) = 1 

x y o. 

Note that because we are maximizing a convex function A n = {X G S n : Tr(X) = 1, X y 0} which 
is convex, the solution must be an extreme point of A n (i.e. a rank one matrix), hence we can drop the 
rank constraint here. Unfortunately, X \— > (afXai — p) + , the function we are maximizing, is convex in X 
and not concave, which means that the above problem is still hard. However, we show below that on rank 
one elements of A n , it is also equal to a concave function of X, and we use this to produce a semidefinite 
relaxation of problem (2). 

Proposition 1 Let A G R nxn , p > and denote by a±, . . . , a n G R n the columns of A, an upper bound on 

<f>(p)= max. £™ =1 (afXa; - p) + 

s.t. Tr(X) = 1, X y 0, Rank(X) = 1 K ' 

can be computed by solving 

iP(p)= max. Er=iTr(^ 1/2 ^ 1/2 )+ (m 
s.t. Tr(X) = 1, X 1 0. 1 ' 

in the variables X G S n , where Bi = aiaj — pi, or also 

^{p)= max. Y^=x^{PiBi) m , 

s.t. Tr(x) = i, x y o, x y Pi y o, y ' 

which is a semidefinite program in the variables X G S n , P{ G S n . 



5 



Proof. We let X 1 / 2 be the positive square root (i.e. with nonnegative eigenvalues) of a symmetric positive 
semi-definite matrix X. fn particular, if X = xx T with ||x|| = 1, then X 1 / 2 = X = xx T , and for all j3 G R, 
f5xx T has one eigenvalue equal to (3 and n — 1 equal to 0, which implies Tr{j3xx T ) + = (3 + . We thus get 

(afXdi — p) + = Tr((afxx T ai — p)xx T ) + 
= Tr(x(x T aia[x — p)x T ) + 

= TriX 1 / 2 ai aJX 1 / 2 - P X) + = Tr{X l / 2 {a ia J - pl)X l ' 2 ) + . 

For any symmetric matrix B, the function X h- > Tr(X 1 ^ 2 BX 1 / 2 ) + is concave on the set of symmetric 
positive semidefinite matrices, because we can write it as 

Tr(X l l 2 BX l / 2 )+ = max Tr(PB) 

min Tr(YX), 

{Y>B, Y^O} 

where this last expression is a concave function of X as a pointwise minimum of affine functions. We can 
now relax the original problem into a convex optimization problem by simply dropping the rank constraint, 
to get 

</>(p) = max. YZ=i Tr(X l / 2 a ia TX 1 / 2 - P X) + 
s.t. Tr(X) =1,1^0, 

which is a convex program in X G S n . Note that because Bi has at most one nonnegative eigenvalue, we 
can replace Tr(X 1 / 2 ajaf X 1 / 2 — pX) + by A max (X 1 / 2 ajaf X 1 / 2 — pX) + in the above program. Using the 
representation of Tr(X 1 / 2 BX l / 2 ) + detailed above, problem (12) can be written as a semidefinite program 

il>(p)= max. JXiTr(P^) 

s.t. Tr(x) = i, x y o, x y Pi y o, 

in the variables X £ S n , Pi G S n , which is the desired result. ■ 

Note that we always have ip(p) > 4>{p) and when the solution to the above semidefinite program has rank 
one, ip(p) = 4>(p) and the semidefinite relaxation (13) is tight. This simple fact allows us to derive sufficient 
global optimality conditions for the original sparse PC A problem. We recall in particular the following result 
from d'Aspremont et al. (2008) which provides sufficient conditions for a particular nonzero coefficient 
pattern / to be globally optimal. The optimal solution x to (2) is then found by solving an eigenvalue 
problem on the principal submatrix of S with support /. 

Proposition 2 Let A G R nxn ; p > o, £ = A T A with a\, . . . , a n G R n the columns of A. Given a sparsity 
pattern I, setting x to be the largest eigenvector of^2 ieI aiaj , if there is a p* > such that the following 
conditions hold 

/ n \ 

max(afx) 2 < p* < mm(ajx) 2 and A max / 5^ < / ((a[x) 2 — p*), 



with the dual variables defined as 



Y - max 1 o ^ ~ p) 1 (I ~ ^^K 1 ~ xxT ) when , G jc 
l ~ { ,P (p-(aJx) 2 )j ||(I-xx> 4 || 2 ' whenieI 

and 

BiXX T Bi 

Yi = — =, , when i G i, 

x 1 BiX 
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then the sparsity pattern I is globally optimal for the sparse PCA problem (2) with p = p* and we can form 
an optimal solution z by solving the maximum eigenvalue problem 

TV 

z = argmax z nz. 

{z /c =0, ||z||=l} 

This result also provides tractable lower bounds on the optimal value of (2) whenever the solution is not 
optimal. 

3 Algorithms 

In this section, we describe algorithms for solving the semidefmite relaxations detailed above. We also 
describe greedy methods to improve the quality of these solutions. 

3.1 First-order methods 

Again, given a covariance matrix £ G S n , the DSPCA code solves a penalized formulation of problem (5), 
written as 

maximize Tr(SX) - pl T \X\l 

subject to Tr(X) = 1 (14) 

x y o, 

in the variable X G S n . The dual of this program can be written as 

minimize f(U) = A max (S + U) 
subject to | Uij | < p. 

in the variable U G S n . The algorithm in d'Aspremont et al. (2007); Nesterov (2007) regularizes the objec- 
tive f(U) in (15), replacing it by the smooth (i.e. with Lipschitz continuous gradient) uniform approximation 

f^U) = plog (Trexp((S + U)/p)) - plogn. 

Following Nesterov (1983), solving the smooth problem 

nun f,(U) 

where Q = {U G S n , \Uij\ < p}, with p = e/21og(n) then produces an e-approximate solution to 
(14). The key difference between the minimization scheme developed in Nesterov (1983) and classical 
gradient minimization methods is that it is not a descent method but achieves a complexity of 0(L/N 2 ) 
instead of 0(1/N) for gradient descent, where N is the number of iterations and L the Lipschitz constant 
of the gradient. Furthermore, this convergence rate is provably optimal for this particular class of convex 
minimization problems (see (Nesterov, 2003, Th. 2.1.13)). Thus, by sacrificing the (local) properties of 
descent directions, we improve the (global) complexity estimate by an order of magnitude. For our problem 
here, once the regularization parameter p is set, the algorithm is detailed as Algorithm 1. 

The algorithm has four main steps. Step one computes the (smooth) function value and gradient. The 
second step computes the gradient mapping, which matches the gradient step for unconstrained problems 
(see (Nesterov, 2003, p.86)). Step three and four update an estimate sequence see ((Nesterov, 2003, p.72)) of 
ffj, whose minimum can be computed explicitly and gives an increasingly tight upper bound on the minimum 
of fjj,. We now present these steps in detail for our problem (we write U for Ui and X for Xi). 
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Algorithm 1 First-Order Algorithm. 

Input: The covariance S G R nxn 5 an d a parameter p > controlling sparsity. 
1: fori = lto N do 
2: Compute ffj,(Ui) and Vf^Ui) 

3: Find Y = argmin yes (Vf^U&Y) + \L\\Ui - Y\\ 2 F 

4: Find Wi = argmin WeS + £^ =Q i±I (/„(£/,■) + (V/ M (C/,), W - [/,»} 

5: Set t^! = &Wi + ffiYi 
6: end for 
Output: A matrix U G S n . 



Step 1. The most expensive step in the algorithm is the first, the computation of / M and its gradient. The 
function value can be reliably computed as 

ffj,(U) = dmax + Atlog ( ^exp(- — J -/ilogn. 

\i=i ^ / 

where are the eigenvalues of E + U. The gradient V/ M (i7) can be computed explicitly as 

V/„(E0 := exp ((£ + £/)///) / Tr (exp ((£ + . 
which means computing the same matrix exponential. 

Step 2. This step involves a problem of the form 

arg min (Vf^U), Y) + \l\\U - Y\\ 2 F , 
where U is given. The above problem can be reduced to a Euclidean projection 

arg min \\Y — V\\f, (16) 

\\Y\\oo<l 

where V = U — L~ l V /^(U) is given. The solution is given by 

Yij = sgn(T^)min(|T4j|,l), i,j = 1, . . . ,n. 

Step 3. The third step involves solving a Euclidean projection problem similar to (16), with the solution 
V defined by 

i=0 

Stopping criterion We can stop the algorithm when the duality gap is smaller than e: 

gap fc = A max (S + U k ) - TrSX, + l T \Xi\l < e, 

where X k = V/^(Z7) is our current estimate of the dual variable. The above gap is necessarily non-negative, 
since both and Ui are feasible for the primal and dual problem, respectively. This is checked periodically, 
for example every 100 iterations. 
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Complexity Overall the algorithm requires 



01 p 



ra-^/log n 



) 



(17) 



c 



iterations d'Aspremont et al. (2007); Nesterov (2007). The main step at each iteration is computing the 
matrix exponential exp((E + f7)//i) (see Moler and Van Loan (2003) for a comprehensive survey) at a cost 
of 0(n 3 ) flops. 

3.2 Greedy methods 

We can also find good solution to problem (2), or improve existing solutions, using greedy methods. We 
first present very simple preprocessing solutions with complexity 0(n log n) and 0(n 2 ). We then recall a 
simple greedy algorithm with complexity 0(n 4 ). Finally, our first contribution in this section is to derive 
an approximate greedy algorithm that computes a full set of (approximate) solutions for problem (2), with 
complexity 0(n 3 ). 

3.2.1 Sorting and thresholding 

The simplest ranking algorithm is to sort the diagonal of the matrix E and rank the variables by variance. 
This works intuitively because the diagonal is a rough proxy for the eigenvalues: the Schur-Horn theo- 
rem states that the diagonal of a matrix majorizes its eigenvalues Horn and Johnson (1985); sorting costs 
0(n log n). Another quick solution is to compute the leading eigenvector of E and form a sparse vector by 
thresholding to zero the coefficients whose magnitude is smaller than a certain level. This can be done with 
cost 0(n 2 ). 

3.2.2 Full greedy solution 

Following Moghaddam et al. (2006b), starting from an initial solution of cardinality one at p = En, we 
can update an increasing sequence of index sets Ik C [1, n], scanning all the remaining variables to find the 
index with maximum variance contribution. 

Algorithm 2 Greedy Search Algorithm. 
Input: SeR nxn 

1: Preprocessing: sort variables by decreasing diagonal elements and permute elements of E accordingly. 
2: Compute the Cholesky decomposition E = A T A. 
3: Initialization: I\ = {1}, x\ = ai/||ai||. 
4: for i = 1 to k taT ^ ct do 

5: Compute i k = argmax^ A max fe ieJfeU{i} a j a J^j ■ 

6: Set = ifc U {ik} and compute Xk+\ as the leading eigenvector of Yljel k i a j a J- 
7: end for 
Output: Sparsity patterns Ik- 



At every step, If. represents the set of nonzero elements (or sparsity pattern) of the current point and we 
can define Zk as the solution to problem (2) given Ik, which is: 



which means that Zk is formed by padding zeros to the leading eigenvector of the submatrix E/ fe / fe . Note 
that the entire algorithm can be written in terms of a factorization E = A T A of the matrix E, which means 



Zk = argmax z T Y*z — pk 

{z r c=0, \\z\\=l} 
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significant computational savings when S is given as a Gram matrix. The matrices ^i k j k and ^2 ieI aiaj 
have the same eigenvalues and if z is an eigenvector of Sj fc 4, then Aj k z/\\Ai k z\\ is an eigenvector of 

3.2.3 Approximate greedy solution 

Computing n — k eigenvalues at each iteration is costly and we can use the fact that uu T is a subgradient of 
Amax at X if u is a leading eigenvector of X Boyd and Vandenberghe (2004), to get: 

Vjg/feUW / \jel k J 

which means that the variance is increasing by at least (x^ai) 2 when variable i is added to Ik- This provides 
a lower bound on the objective which does not require finding n — k eigenvalues at each iteration. We then 
derive the following algorithm. 

Algorithm 3 Approximate Greedy Search Algorithm. 
Input: S £ R nxn 

1 : Preprocessing: sort variables by decreasing diagonal elements and permute elements of £ accordingly. 
Compute the Cholesky decomposition S = A T A. 
Initialization: I\ = {1}, x\ = ai/||ai||. 
for i = 1 to k taT & ct do 

Compute ik = argmax^ Jfc (x^aj) 2 . 

Set Ik+i = Ik U {ik} and compute Xk+i as the leading eigenvector of i a,jaj. 

end for 
Output: Sparsity patterns 



Again, at every step, Ik represents the set of nonzero elements (or sparsity pattern) of the current point 
and we can define Zj. as the solution to problem (2) given Ik, which is: 

Zk = argmax z T T,z — pk, 

{zjc=0, \\z\\=l} 
k 

which means that Zk is formed by padding zeros to the leading eigenvector of the submatrix Ej fc j fe . Better 
points can be found by testing the variables corresponding to the p largest values of (x^a^) 2 instead of 
picking only the best one. 



3.2.4 Computational complexity 

The complexity of computing a greedy regularization path using the classic greedy algorithm in §3.2.2 is 
0(n 4 ): at each step k, it computes (n — k) maximum eigenvalue of matrices with size k. The approximate 
algorithm in §3.2.3 computes a full path in 0(n 3 ): the first Cholesky decomposition is <9(n 3 ), while the 
complexity of the fc-th iteration is 0(k 2 ) for the maximum eigenvalue problem and 0(n 2 ) for computing 
all products (x T cij). Also, when the matrix S is directly given as a Gram matrix A T A with A £ R'? xri with 
q < n, it is advantageous to use A directly as the square root of S and the total complexity of getting the 
path up to cardinality p is then reduced to 0(p 3 + p 2 n) (which is 0(p 3 ) for the eigenvalue problems and 
0(p 2 n) for computing the vector products). 
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4 Applications 



In this section, we illustrate the sparse PCA approach in applications: in news (text), finance and voting data 
from the US Senate. 

4.1 News data 

Our first dataset is a small version of the "20-newsgroups" data 1 . The data records binary occurrences of 
100 specific words across 16242 postings, where the postings have been tagged by the highest level domain 
in Usenet. 




2ndP.C. IstP.C. 

Number of Principal Comp. 

Figure 1: PCA with 20-Newsgroups data. Left: Explained variance vs. number of PCs. Right: 3D 
visualization via PCA. 

Each posting is viewed as one point in a 100-dimensional space. We begin with a standard PCA on 
the data. Fig. 1 (left) shows the cumulative percentage of variance explained as we increase the number of 
principal components. The slow increase means that the data does not lie within a subspace of significantly 
low dimension. We can anyway proceed to visualize the data: Fig. 1 (right) is the result obtained by 
projecting it on a subspace of dimension 3, chosen by selecting the eigenvectors corresponding to the three 
largest eigenvalues of the covariance matrix. Since these 3 vectors are dense, the axes in Fig. 1 (right) do 
not have a clear interpretation. 

With sparse PCA, we hope to find a set of corresponding sparse principal components, which still help 
with visualization nearly as well as PCA does, and yet reveal some interesting structure. To achieve this, 
we have run the first-order algorithm of Section 3. 1 (referred to as "DSPCA" hereafter) on the data with a 
range of values for the penalty parameter p. We obtained a plot of the variance explained by the first sparse 
principal component (PC), as a function of its cardinality (Fig. 2). We then selected a cardinality that can 
explain at least 90% of the variance explained by the the first principal component obtained from PCA. Then 
we have deflated the covariance matrix by taking out the part due to the first sparse PC, and then repeated 
the above procedure to obtain the second sparse PC. In the same way, we have solved for the third sparse 
PC. Fig. 2 also shows the projection of the data on the 3-dimensional subspace that is spanned by the three 
sparse PCs obtained above. 

'available from http : / / cs . nyu . edu/ ~ roweis/data .html. 
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Figure 2: Sparse PCA on the 20Newsgroups data set. First three principal components and 3D 
visualization. The first three principal components have cardinalities 26, 30 and 10 respectively. 



We first note that only a small number of words, out of the total of 100 words that can appear in each 
sparse PC, can explain more than 90% of variance explained by the corresponding PC. Specifically, we 
obtain 30 words for the first PC, 26 for the second, and 10 for the third. The lists of words associated with 
each sparse PCs is given in Table 1, and reveals some structure about each one of the sparse PCs. That is, 
the 30 words associated with the first sparse PC are almost all about politics and religion, the 26 words in the 
second sparse PC are all computer-related, and the majority of the 10 words in the third sparse PC concerns 
science. Hence, applying sparse PCA to this data set allows to discover structure that is otherwise hidden in 
the standard PCA, for example that the first principal component is mainly related to politics and religion. 

We also run the thresholded PCA and DSPCA algorithms over the collection of 1,288 news articles 
published by the New York Times'?, International section mentioning the word "China." We tokenize the 
articles by unigrams, remove no stop words, and perform no stemming. The data encodes the binary {0, 1} 
values (corresponding to appearance/non-appearance) of 86500 tokens. 

Figure 3 shows the percentage of explained variance as a function of cardinality. Here we see DSPCA 
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Table 1: Words associated with the first three sparse PCs. 



does outperform Thresholded PCA, though not by a big margin. Although we do not have ground truth, 
Table 2 and Table 3 contains words selected by two algorithms respectively as we increase cardinality. 
Words selected by DSPCA appear much more meaningful than those chosen by thresholded PCA at the 
same cardinality. 

4.2 Senate voting data 

In this section, we analyze the voting records of the 109th US Senate (2000-2004) There were 101 senators 
(one extra Senator is due to a replacement during the term) and 48 bills involved. To simplify, the votes are 
divided into yes (coded as 1) or no (coded as -1), and other votes are coded as 0. 

Each senator's voting record can be viewed as a point in a 48-dimensional space. By applying PCA, 
and projecting each senator's voting record onto a two-dimensional subspace of maximum variance, we 
can see that senators are almost perfectly separated by partisanship (Fig. 4). However, since the principal 
components involve all the bills, it is hard to tell which bills are most responsible for the explained variance. 
By applying Sparse PCA to the voting record, we aim to find a few bills that not only divide the senators 
according to partisanship, but also reveal which topics are most controversial within the Republican and 
Democratic parties. Fig. 4 (right) shows the senators' voting records, projected onto the first two sparse 
principal components. We note that in the two-dimensional space senators are still divided by partisanship. 
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In fact, many republican senators perfectly coincide with each other and so are democratic senators. In 
contrast to Fig. 4 (left), the cardinalities associated with the first and second sparse principal components 
are 5 and 2 respectively, which makes it possible to interpret the coordinates. 



▲ 


• Democrats 
a Republicans 


• • *. 


• 




^ * 
* 

• 

• 


* ▲ 
• ▲ 




▲ 



-2 2 4 

1st P.C. (Card = 48) 



cs 
n 

U 



— 



1.5 
1 

0.5 




u -° 5 

PL, 



-1 

-1.5 

-2 



▲ ▲ A 



• Democrats 
a Republicans 



1st P.C. (Card = 5) 



Figure 4: 109th Senate's voting record projected onto the top 2 principal components. 

Let us examine the bills appearing in the first two sparse principal components. For the first sparse PC, 
the corresponding bills' brief description is as follows: 

• S. 1932, As Amended; Deficit Reduction Act of 2005. 

• S. Con. Res. 83; An original concurrent resolution setting forth the congressional budget for the United 
States Government for fiscal year 2007 and including the appropriate budgetary levels for fiscal years 
2006 and 2008 through 201 1. 

• S. 3930, As Amended; Military Commissions Act of 2006. 

• S. 403, As Amended; Child Interstate Abortion Notification Act. 
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Table 2: 1st PC from DSPCA on 1,288 New York Times articles mentioning the word "China" for 
various values of the eigenvector cardinality fc. 
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Table 3: 1st PC from Thresholded PCA on 1,288 New York Times articles mentioning the word 
"China" for various values of the eigenvector cardinality fc. 



• Passage of S. 397, As Amended; Protection of Lawful Commerce in Arms Act. 
The brief description for the two bills in the second sparse principal component are: 

• H. R. 3045; Dominican Republic-Central America-United States Free Trade Agreement Implementa- 
tion Act. 

• S. 1307; Dominican Republic-Central America-United States Free Trade Agreement Implementation 
Act. 

A glance at these bills tells us that the major controversial issues between Democrats and Republicans are 
topics such as "abortion", "military", "budget", and "free trade". 

Fig 5 plots the variance explained by the first sparse principal component divided by that explained by 
the first PC, as a function of the cardinality of the sparse PC. Fig. 5 also shows how the cardinality of the 
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first sparse PC varies as the penalty parameter p is changed in the DSPCA code. We can see that when 19 
out of 48 variables (bills) are used, sparse PCA almost achieves the same statistical fidelity as standard PCA 
does. 




10 20 30 40 50 0.2 0.4 0.6 0.8 1 

P 

Cardinality 



Figure 5: Left: Explained variance as a function of cardinality. Right: Cardinality as a function of 
penalty parameter p. 



4.3 Stock market data 

In this section, we investigate the historical prices of S&P500 stocks over 5 years, from June 1st, 2005, 
through June 1st, 2010. By taking out the stocks with less than 5 years of history, we end up with 472 
stocks, each having daily closing prices over 1259 trading days. The prices are first adjusted for dividends 
and splits and then used to calculate daily log returns. Each day's return can be represented as a point in 
R 472 . 




Cardinality 



Figure 6: Left: Explained variance as a function of cardinality. Right: Cardinality as a function of 
penalty parameter p. 
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Fig. 6 shows the explained variance as a function of 1st PC's cardinality. It seems hard to say that the 
1st PC is sparse, since there is no natural "kink" in that curve. That is, we need almost 300 out of the total 
472 stocks to explain at least 90% of the variance explained by the 1st PC from PC A. However, when we 
inspect the sparse PCs with increasing cardinalities, we note that initially only stocks from the "Financials" 
sector come to play and later until, at cardinality 32, do we see companies from other sectors appearing in 
the 1st sparse PC. So we take the first sparse PC with cardinality equal to 32. Then we solve for the 2nd 
sparse PC, and using the same guideline to arrive at a cardinality of 26. 
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» * . . 



5 -1 -0.5 0.5 1 1.5 
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Figure 7: S&P500 daily returns projected onto the top 2 principal components. For PCA (left) and 
sparse PCA (right). 

Figure 7 show the stock returns projected onto the 2-dimensional subspaces spanned by the top 2 PCs 
and top 2 sparse PCs, respectively. Comparing these two plots, we observe two interesting phenomena: 

• Although the top 2 principal components from PCA explain more variance (as seen from the larger 
range of the axes in the left over the right panel), the two sparse principal components from DSPCA 
involve only 58 out of 472 stocks (32 on the first PC and another distinct 26 on the second). Further- 
more, 31 of the 32 stocks in the first sparse PC are all from the sector "Financials", and that almost all 
26 stocks in the second sparse PC come from "Energy" and "Materials" except 2 from "Financials" 
and 1 from "Information Technology". Considering that there are 10 sectors in total, this is quite 
interesting as Sparse PCA is able to identify the right groups (industry factors) that explains most of 
the variance. Our data covers June 2005 through June 2010 where a severe financial crisis took place, 
and the key role of the Financial sector is revealed purely through our sparse PCA analysis. 

• In Fig. 6, the projected data appears symmetrically distributed around its center. In contrast, In Fig. 7 
(right), we observe a definite orientation. Since the horizontal axis (first PC) corresponds to "Finan- 
cials" and the vertical one to "Energy" and "Materials", the sparse PCA analysis tells us that these 
two sectors are positively correlated. 

4.4 Random matrices 

Sparse eigenvalues of random matrices play a central role in characterizing the performance of l\ decoders in 
compressed sensing applications. Testing the Restricted Isometry Property (RIP) in Candes and Tao (2005) 
amounts to bounding the maximum and minimum eigenvalues of a Gram matrix. Here, we compute the 
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Cardinality Cardinality 



Figure 8: Upper and lower bound on sparse maximum eigenvalues. We plot the maximum sparse 
eigenvalue versus cardinality, obtained using exhaustive search (solid line), the approximate greedy 
(dotted line) and fully greedy (dashed line) algorithms. We also plot the upper bounds obtained 
by minimizing the gap of a rank one solution (squares), by solving the £q semidefinite relaxation 
explicitly (stars) and by solving the DSPCA dual l\ relaxation (diamonds). Left: On a matrix F T F 
with F Gaussian. Right: On a sparse rank one plus noise matrix. 



upper and lower bounds on sparse eigenvalues produced using various algorithms. We pick the data matrix 
to be small enough so that computing sparse eigenvalues by exhaustive search is numerically feasible. In 
Figure 8, we plot the maximum sparse eigenvalue versus cardinality, obtained using exhaustive search (solid 
line), the approximate greedy (dotted line) and fully greedy (dashed line) algorithms. We also plot the 
upper bounds obtained by minimizing the gap of a rank one solution (squares), by solving the semidefinite 
relaxation in §2.2 explicitly (stars) and by solving the DSPCA dual (diamonds). On the left, we use a matrix 
£ = F T F with F Gaussian. On the right, £ = mm t /||u|| 2 + 2V T V, where itj = i = 1, . . . ,n and 
V is matrix with coefficients uniformly distributed in [0, 1]. Almost all algorithms are provably optimal in 
the noisy rank one case (as well as in many example arising from "natural data"), while Gaussian random 
matrices are harder. Note however, that the duality gap between the semidefinite relaxations and the optimal 
solution is very small in both cases, while our bounds based on greedy solutions are not as good. Overall, 
while all algorithms seem to behave similarly on "natural" or easy data sets, only numerically expensive 
relaxations produce good bounds on the random matrices used in compressed sensing applications. 

4.5 Statistical consistency vs. computational complexity 

As we hinted above, very simple methods such as thresholding or greedy algorithms often perform well 
enough on simple data sets, while obtaining good statistical fidelity on more complex (or random) data sets 
requires more complex algorithms. This is perfectly illustrated by the results in Amini and Wainwright 
(2009) on a spiked covariance model. To summarize these results, suppose that the sample covariance 
matrix S G S„ is a noisy estimate of the true population covariance S G S n with £ = £ + A where A 
is a noise matrix, suppose also that the leading eigenvector of the true covariance is sparse with cardinality 
k. Under some assumptions on the noise component A, Amini and Wainwright (2009) show that when 
the ambient dimension n, the number of observations m and the number k of nonzero components in the 
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leading eigenvector all scale to infinity, and when the ratio 



thrcs k 2 log(n — k) 

is above some critical value, then simply thresholding the diagonal of the sample covariance matrix will 
recover the exact support of the leading eigenvector of £ with probability tending to one. On the other hand, 
simple thresholding fails with probability one when this ratio is below a certain value. Furthermore, Amini 
and Wainwright (2009) show that when 



s p k log(n — k) 

is above some critical value, the solution of the semidefinite relaxation in Section 2.1 (if tight) will recover 
the exact support of the leading eigenvector of £ with probability tending to one. On the other hand, the 
semidefinite relaxation fails with probability one when this ratio is below a certain value. They also show that 
the semidefinite programing relaxation in Section 2. 1 is statistically optimal, meaning that no other method 
(even combinatorial ones) can recover the true support using fewer samples (up to a constant factor). This 
result clearly illustrates a tradeoff between statistical fidelity on one side and computational complexity 
on the other. In the spiked model, the semidefinite relaxation requires 0{l/k) fewer samples than simply 
thresholding the diagonal to recover the true support of the leading eigenvector of £, but its complexity is 
much higher than that of the thresholding strategy. 

We can further illustrate this behavior on a simple numerical example. Suppose we are given a sample 
covariance £ G S n coming from a "spiked" model of covariance similar to that in Amini and Wainwright 
(2009), with 

t = uu T + VV T /V^ 

where u G R™ is the true sparse leading eigenvector, with Card(u) = k, V G R™ xm j s a noise matrix 
with Vij ~ A/"(0, 1) and m is the number of observations. We compare the performance of the simple 
thresholding method (on the leading eigenvector of regular PCA here) with that of the semidefinite relaxation 
when recovering the support of u for various values of the number of samples. Our point here is that, while 
variance versus cardinality is a direct way of comparing the performance of sparse PCA algorithms, accurate 
recovery of the support is often a far more important objective. Many methods produce similar variance 
levels given a limited budget of nonzero components, but their performance in recovering the true support 
is often markedly different. 

In Figure 9 on the left we compare ROC curves when recovering the support of u in the spiked model 
above using thresholded PCA, the approximate and full greedy algorithms in d'Aspremont et al. (2008) and 
semidefinite relaxation (DSPCA). On the right, we plot Area Under ROC as the number of samples increase. 
As expected, we observe that the semidefinite relaxation performs much better when only a limited number 
of observations are available (m small). 



5 Conclusion 

We have reviewed here several techniques for approximating the solution to the single factor sparse PCA 
problem. While the algorithms presented here perform quite well, several key questions remain open at this 
point. 

First, outside of the (locally) convergent algorithm in Journee et al. (2008), very few methods handle the 
problem of simultaneously finding several leading sparse principal components. 

Also, as the examples of Section 4 illustrate, most methods (even extremely simple ones) perform well 
enough on easy, "natural" data sets while only the most expensive semidefinite relaxations seem to produce 
good bounds on the random matrices used in compressed sensing applications for example. Characterizing 
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Specificity Number of samples m 

Figure 9: Left: ROC curves when recovering the support of u in the spiked model using thresh- 
olding, approximate and exact greedy algorithms and the semidefinite relaxation (DSPCA) in Sec- 
tion 2.1 in the spiked model when n = 250, m = 400 and k = 25. Right: Area Under ROC 
(AUROC) versus number of samples m. 



what makes "natural" data sets easier than random ones remains an open problem at this point. It is also 
not clear yet how to extend the statistical optimality statements of Amini and Wainwright (2009) to broader 
(e.g. deterministic) classes of matrices. 

Finally, the question of approximation bounds a la MAXCUT for the relaxations detailed here is largely 
open. Basic performance bounds are discussed in d'Aspremont and El Ghaoui (2008); Bach et al. (2010) 
but they can certainly be tightened. 
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